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Abstract 

Although  several  researchers  have  integrated  methods  for  re¬ 
inforcement  learning  (RL)  with  case-based  reasoning  (CBR) 
to  model  continuous  action  spaces,  existing  integrations  typ¬ 
ically  employ  discrete  approximations  of  these  models.  This 
limits  the  set  of  actions  that  can  be  modeled,  and  may  lead  to 
non-optimal  solutions.  We  introduce  the  Continuous  Action 
and  State  Space  Learner  (CASSL),  an  integrated  RL/CBR 
algorithm  that  uses  continuous  models  directly.  Our  empiri¬ 
cal  study  shows  that  CASSL  significantly  outperforms  two 
baseline  approaches  for  selecting  actions  on  a  task  from  a 
real-time  strategy  gaming  environment. 

1.  Introduction 

Real-time  strategy  (RTS)  games  are  a  popular  recent  focus 
of  attention  for  A1  research  (Buro  2003),  and  competitions 
now  exist  for  testing  intelligent  agents  in  these  environ¬ 
ments  (e.g.,  AI1DE  2007;  NIPS  2008).  RTS  environments 
are  usually  partially  observable,  sequential,  dynamic,  con¬ 
tinuous,  and  involve  multiple  agents  (Russell  and  Norvig 
2003).  Typically,  each  player  controls  a  team  of  units  that 
can  gather  resources,  build  structures,  leam  technologies, 
and  conduct  simulated  warfare,  where  the  usual  goal  is  to 
destroy  opponent  units.  Popular  RTS  environments  for  in¬ 
telligent  agent  research  include  Wargus  (Ponsen  et  al. 
2005),  ORTS  (Buro  2002),  and  MadRTS,  a  game  devel¬ 
oped  by  Mad  Doc  Software,  LLC. 

A  main  attraction  of  RTS  environments  is  that  they  can 
be  used  to  define  and  provide  feedback  for  challenging 
real-time  control  tasks  (e.g.,  controlling  single  units,  win¬ 
ning  an  entire  game)  characterized  by  large,  continuous  ac¬ 
tion  and  state  spaces.  However,  the  vast  majority  of  intelli¬ 
gent  agent  research  with  RTS  environments  relies  on  dis¬ 
cretizing  these  spaces  (see  §2.2).  This  process  biases  the 
learner,  and  may  render  optimal  actions  inaccessible. 

In  this  paper,  we  describe  CASSL  (Continuous  Action 
and  State  Space  Learner),  an  algorithm  that  integrates 
case-based  reasoning  (CBR)  and  reinforcement  learning 
(RL)  methods  that  do  not  discretize  these  spaces.  We 
demonstrate  and  analyze  CASSL’s  utility  in  the  context  of 
a  task  defined  in  MadRTS.  Although  previous  research  ex¬ 
ists  on  continuous  action  and  state  spaces,  as  well  as  on 
CBR/RL  integrations,  we  believe  ours  is  unique  in  how  it 
generates  actions  from  stored  experience. 
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Section  2  includes  a  brief  introduction  to  the  real-time 
strategy  environment  we  use  along  with  a  summary  of  re¬ 
lated  research.  In  Section  3,  we  introduce  CASSL  and  ex¬ 
plain  how  it  extends  our  previous  research.  Section  4  de¬ 
scribes  an  evaluation  of  its  utility  in  comparison  with  sim¬ 
pler  approaches.  Finally,  in  Section  5  we  provide  greater 
context  for  interpreting  these  results. 

2.  Background:  Domain  and  Related  Work 

2.1  Real-Time  Strategy  Domain 

In  this  paper,  we  focus  on  performance  tasks  whose  space 
of  possible  actions  A  and  state  space  S  are  multi-dimen¬ 
sional  and  continuous.  We  use  MadRTS,  whose  engine  is 
also  used  in  the  Empire  Earth  IT™  game,  for  our  evalua¬ 
tion.  We  also  considered  using  Wargus,  but  chose  to  use 
MadRTS  because  it  is  more  reliable  and  supports  scenarios 
with  higher  military  relevance. 

We  created  MadRTS  scenarios  to  test  the  capabilities  of 
intelligent  agents  for  controlling  a  set  of  units  using  a  fea¬ 
ture-vector  representation.  Figure  1  shows  a  snapshot  of 
one  of  these  scenarios,  in  which  the  units  to  be  controlled 
are  the  soldier  units  in  the  lower  left  comer.  Their  task  is  to 
eliminate  the  opposing  units  in  the  scenario,  which  are  lo¬ 
cated  in  the  top  left  and  lower  right  comers.  An  action  in 
this  space  corresponds  to  an  order  given  by  the  agent  to  a 
group  of  units.  Each  order  directs  the  soldiers  to  travel 
along  a  vector  starting  at  their  current  position,  attacking 
any  opponent  units  they  encounter  after  completing  this 
movement.  The  lengths  of  these  movements  are  variable, 
so  some  actions  have  longer  durations  than  others.  We 
evaluate  an  agent  based  on  how  many  orders  it  gives,  not 
how  much  time  it  requires  to  complete  a  task. 

The  four  dimensions  of  the  continuous  action  space  are: 

-  Heading  e  [0°,360°],  where  0°  is  the  heading  from 
the  original  midpoints  of  player  and  opponent  sol¬ 
diers 

-  Distance  e  [0 ,d],  where  d  is  the  longest  traversable 
distance  in  the  scenario 

-  Group  size  e  [0,g],  where  g  is  the  number  of  control¬ 
lable  units  in  the  scenario 

-  Group  selection  e  {all,  strongest,  leastRecent} , 
where  the  values  indicate  the  method  used  to  select  a 
group 

The  state  space  consists  of  eight  features,  which  are  de¬ 
fined  relative  to  the  midpoint  of  the  player’s  units: 
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Figure  1:  The  MadRTS  State  Space 

-  Percentage  of  player’s  initial  soldiers  still  alive 

-  Percentage  of  opponent’s  initial  soldiers  still  alive 

-  Percentage  of  territories  owned  by  the  player 

-  Heading  to  nearest  opponent  soldier  e  [0°,360°] 

-  Heading  to  midpoint  of  opponent  soldiers  e 
[0°,360°] 

-  Distance  to  midpoint  of  enemy  soldiers  £  [0,<7] 

-  Dispersal  of  opponent  soldiers  £  [0,t/|,  defined  as  the 
median  of  distances  from  each  opponent  soldier  to 
the  midpoint  of  all  opponent  soldiers 

-  Dispersal  of  player  soldiers  £  [0 ,d\ 

At  each  time  point,  the  agent  receives  a  state  vector  with 
the  values  of  these  features  and  selects  an  action  to  exe¬ 
cute.  For  example,  in  Figure  1,  the  player  units  are  at  their 
initial  location  in  the  lower  left  comer,  and  the  opponents 
are  in  the  upper  left  and  lower  right  corners.  For  both 
sides,  the  percentage  of  original  soldiers  remaining  is 
100%  and  the  percentage  of  territories  (borders  shown  by 
dotted  lines)  owned  is  25%.  The  heading  to  the  nearest  op¬ 
ponent  soldier  is  shown  by  0oppncar.  The  heading  to  the 
midpoint  of  opponent  soldiers  is  0midopp  and  the  distance 
to  their  midpoint  is  shown  as  a  line  from  midpiayer  to  midopp. 
The  dispersal  of  opponent  soldiers  is  labeled  dispopp,  while 
the  (small)  dispersal  of  player  soldiers  is  not  shown. 

2.2  Related  Work 

Several  techniques  for  decision  making  have  been  tested  in 
RTS  environments,  including  relational  Markov  decision 
processes  (Guestrin  et  al.  2003),  integrated  scheduling  and 
means-ends  analysis  planning  (Chan  et  al.  2007),  and  sim¬ 
ulation  combined  with  Nash  equilibrium  approximation 
(Sailer,  Buro,  and  Lanctot  2007). 

CBR  and  RL  approaches  have  also  been  investigated 
separately  for  RTS  planning  tasks.  For  example,  CBR  tech¬ 
niques  have  been  designed  to  select  action  sequences  (Aha, 
Molineaux,  and  Ponsen  2005)  and  to  construct  plans  from 
behavioral  cases  extracted  from  and  annotated  by  human 
players  (Otanon  et  al.  2007).  RL  techniques  have  been 
used  to  select:  action  sequences  (Ponsen  et  al.  2006), 
choices  defined  in  partial  programs  (Marthi  et  al.  2005), 
and  challenge-sensitive  actions  (Andrade  et  al.  2005).  Un¬ 


like  these  previous  methods,  CASSL  does  not  discretize 
the  action  space,  and  uses  an  integrated  CBR/RL  approach 
to  select  primitive  actions  in  an  RTS  task. 

Previous  approaches  for  learning  in  the  context  of  con¬ 
tinuous  action  spaces  have  been  investigated  separately  for 
CBR  and  RL  methods.  Among  example  CBR  approaches, 
Aha  and  Salzberg  (1994)  examined  a  set  of  supervised 
learning  approaches  for  a  ball-catching  task.  Their  algo¬ 
rithms  restricted  action  selection  to  among  those  that  had 
been  previously  recorded,  which  limits  the  set  of  actions 
that  can  be  selected  for  a  new  state.  Sheppard  and  Salzberg 
(1997)  describe  a  lazy  (9-learning  approach  for  action  se¬ 
lection  for  a  missile  avoidance  task  in  which  the  set  of 
states  whose  distance  is  within  a  threshold  are  located,  and 
among  those  the  action  is  selected  that  has  the  highest  Q 
value.  CASSL  instead  applies  a  quadratic  model  to  the 
nearest  neighbors  and  selects  an  action  corresponding  to  its 
maximum.  This  differs  from  locally  weighted  regression 
(Atkeson,  Moore,  and  Schaal  1997),  which  computes  a  lo¬ 
cal  linear  model  from  a  query’s  neighboring  cases. 

Traditionally,  RL  methods  have  used  eager  learning 
methods  to  help  select  actions  from  continuous  action 
spaces  (Kaelbling,  Littman,  and  Moore  1996,  §6.2).  For 
example,  these  include  training  a  neural  network  with 
state-action  input  pairs  and  Q  value  outputs,  and  then  ap¬ 
plying  gradient  descent  to  locate  actions  with  high  (7  val¬ 
ues.  Alternatively,  this  network  could  be  used  with  an  ac¬ 
tive  learning  process  to  test  actions  generated  according  to 
a  distribution  whose  mean  and  variance  were  varied  so  as 
to  find  a  local  maximum.  Gaskett  et  al.  (1999)  describe  an 
eager  approach  that  performs  interpolation  with  a  neural 
network’s  outputs.  They  also  survey  continuous  action  (7- 
leaming  systems  and  note  that  most  are  eager  and  yield 
piecewise-constant  functions.  In  contrast,  our  approach 
uses  a  lazy  method  for  action  selection  and  is  not  restricted 
to  piecewise-constant  action-selection  functions. 

Takahashi  et  al.  (1999)  instead  tessellate  a  continuous 
action  space  in  their  (7-learning  extension.  CASSL  does 
not  rely  on  decomposing  the  action  space.  Millan  et  al. 
(2002)  investigate  a  (7-learner  that  explores  a  continuous 
action  space  by  leveraging  the  (7- values  of  neighboring, 
previously-explored  actions.  However,  this  limits  action 
selection  to  the  set  of  previously  explored  actions.  Buck  et 
al.  (2002)  heuristically  select  a  set  of  actions  that  are  dis¬ 
tributed  across  the  action  space  and  select  the  one  corre¬ 
sponding  to  the  maximum-valued  successor  state.  CASSL 
instead  selects  actions  used  in  neighbor  states,  dynamically 
forms  a  quadratic  model  from  them,  and  selects  the  action 
that  yields  a  maximum  value  according  to  this  model. 

Sharma  et  al.  (2007)  integrated  CBR  and  RL  techniques 
in  CARL,  a  hierarchical  architecture  that  uses  an  instance- 
based  state  function  approximator  for  its  reinforcement 
learner  and  RL  to  revise  case  utilities.  They  also  investigat¬ 
ed  its  application  to  scenarios  defined  using  MadRTS. 
However,  CARL’s  action  space  is  discrete,  whereas  our 
contribution  is  an  integrated  method  for  reasoning  with 
continuous  action  spaces.  Santamaria  et  al.  (1997)  also  ex¬ 
amined  integrated  CBR/RL  approaches  that  operate  on 


continuous  action  spaces  and  applied  them  to  non-adver- 
sarial  numeric  control  tasks.  For  example,  this  included  a 
CMAC  (Albus  1975)  approach  for  g-leaming  that  dis¬ 
cretizes  the  set  of  possible  actions  and  selects  the  action 
with  the  highest  g-value.  In  contrast,  CASSL  dynamically 
optimizes  a  continuous  local  model  of  the  action-value 
space,  which  allows  access  to  all  potential  actions  without 
requiring  a  search  over  all  of  them. 

Finally,  unlike  other  g-leaming  extensions  that  select 
from  among  the  actions  in  the  (state)  neighbors  to  a  query, 
Hedger  (Smart  &  Kaelbling  2000)  fits  a  quadratic  surface 
and  selects  an  action  that  maximizes  it.  While  CASSL  also 
calculates  a  regression  surface,  it  is  based  on  the  value  of 
states  that  would  occur  if  the  state  changed  according  to 
trajectories  observed  in  the  past.  Although  these  past  tra¬ 
jectories  may  be  inaccurate  for  the  current  state,  the  values 
predicted  are  influenced  less  by  nearby  cases,  and  provide 
a  more  diverse  basis  for  the  regression  surface. 

3.  Continuous  Action  and  Space  Learning 

We  now  describe  CASSL  (Continuous  Action  and  Space 
Learner),  which  integrates  case-based  and  reinforcement 
learning  methods  to  act  in  an  environment  with  continuous 
states  and  actions.  CASSL  leverages  our  experiences  with 
CaT  (Aha  et  al.  2005),  which  uses  CBR  techniques  (but  not 
RL)  to  control  groups  in  Wargus  (Ponsen  et  al.  2005),  a  dy¬ 
namic,  non-episodic,  and  nearly  deterministic  RTS  envi¬ 
ronment.  CaT’s  control  decisions  focus  on  tactic  selection, 
where  tactics  are  comparatively  long  sequences  of  primi¬ 
tive  actions  lasting  a  significant  fraction  of  a  trial. 

3.1  Motivation  for  this  Integrated  Approach 

CaT  has  two  limitations  that  CASSL  addresses.  First,  CaT 
was  designed  for  an  abstract  action  space  (i.e.,  it  selects 
from  among  a  small  set  of  pre-defined  tactics)  and  required 
a  large  state-space  taxonomy;  it  was  not  designed  for  a 
knowledge-poor  continuous  action  domain  or,  more  gener¬ 
ally,  domains  that  have  a  large  number  of  primitive  actions. 
Techniques  that  make  decisions  of  smaller  granularity  may 
permit  greater  control,  and  eliminate  the  need  for  creating 
tactics  in  advance.  Greater  control  may  also  increase  task 
performance  and  reduce  dependence  on  an  external  source 
of  tactics.  For  example,  suppose  CaT’s  opponent  tries  to 
gain  an  advantage  via  early  use  of  air  units.  If  none  of 
CaT’s  tactics  can  create  air  units  early  on,  it  will  probably 
lose  often.  With  direct  access  to  primitive  actions  that  cre¬ 
ate  new  units,  CASSL  is  not  prone  to  this  problem. 

Second,  CaT  cannot  reason  about  causal  relations 
among  states,  which  can  be  used  to  improve  credit  assign¬ 
ment.  Standard  RL  techniques  for  representing  value  func¬ 
tions  and  action-value  functions  can  represent  these  rela¬ 
tions  (Sutton  and  Barto  1998),  which  motivates  our  investi¬ 
gation  of  an  integrated  CBR/RL  approach  in  this  paper.  For 
example,  if  CaT  tends  to  pick  a  poor-performing  tactic  sub¬ 
sequent  to  a  good  tactic,  then  it  would  average  the  perfor¬ 
mance  across  all  successor  tactics.  In  contrast,  CASSL  in¬ 
stead  uses  a  sample  backup  procedure  that  can  more  quick¬ 
ly  improve  the  accuracy  of  performance  approximations. 


3.2  CASSL  Algorithm 

CASSL  is  a  case-based  reasoner  that  responds  to  each  time 
step  of  a  game  trial  by  executing  a  function  LearnAct, 


T:  Transition  case  base  <S  x  A  x  AS> 

V:  Value  case  base  <S  x  9t  > 

LearnActlSj.j,  ai4,  s;,  rw)  = 

T <-  retainer,  jw,ajW,  srsw) 

V  <- 

;  Update  transition  case  base 

retainReviseV(L  sw,retrieve(LSj)) 

;  Update  value  case  base 

C  <-  retrieve^,  s) 

;  Retrieve  similar  transition  cases 

Mi-  0 

;  Initialize  the  map  of  actions  to  values 

VceC:  M  <- 

M  U  <c.a,  retrieve(F,sj+  c.As)> 

;  Populate  it  for  retrieved  cases’  actions 

a,-  <-  arg  max  aEA  reuse(M,  a) 

;  Fetch  action  w/  max  predicted  reward 

return  a. 

using  the  Nelder-Mead  simplex  method 

Figure  2:  CASSL’s  learning  and  action  selection  function 

which  updates  CASSL’s  case  bases  and  returns  a  new  ac¬ 
tion  to  be  performed.  LearnAct  inputs  a  prior  state  Si.^S, 
an  action  which  was  taken  in  state  the  state  v,ES 

which  resulted  from  applying  a,w  in  and  a  reward  r&R. 
It  outputs  a  recommended  action  a^A.  States  in  S  and  ac¬ 
tions  in  A  are  represented  as  real-valued  feature  vectors. 

Figure  2  details  CASSL’s  LearnAct  function.  It  refer¬ 
ences  two  case  bases,  which  are  updated  and  queried  dur¬ 
ing  an  episode.  The  first  is  the  transition  case  base  T: 
SxAxAS,  which  models  the  effects  of  applying  actions.  T 
contains  observed  state  transitions  that  CASSL  uses  to  help 
predict  future  state  transitions.  These  have  the  form: 

cT=  <■?,  a,  As> 

The  second  case  base  is  the  value  case  base  V:  SX9I, 
which  models  the  value  of  a  state.  It  contains  estimates  of 
the  sum  of  rewards  that  would  be  achieved  by  CASSL 
starting  in  a  state  s  and  continuing  to  the  end  of  a  trial  us¬ 
ing  its  current  policy.  Value  cases  have  the  form: 

cv  =  <s,  v> 

Each  of  CASSL’s  two  case  bases  supports  a  case-based 
problem  solving  process  consisting  of  a  cycle  of  case  re¬ 
trieval,  reuse,  revision,  and  retention  (Aamodt  and  Plaza 
1994).  These  cycles  are  closely  integrated  because  a  solu¬ 
tion  to  a  problem  in  T  forms  a  problem  in  V;  CASSL  solves 
these  problems  in  tandem  to  select  an  action. 

At  the  start  of  a  trial  each  of  CASSL’s  case  bases  is  ini¬ 
tialized  to  the  empty  set.  CASSL  retains  new  cases  and  re¬ 
vises  them  through  its  application  to  a  sequence  of  game¬ 
playing  episodes.  For  each  new  state  s,  that  arises  during  an 
episode,  LearnAct  is  called  with  its  four  arguments. 

LearnAct  begins  with  a  case  retention  step  in  7;  if  an  ex¬ 
perience  occurs  that  is  not  correctly  predicted  by  T,  a  new 
case  cT,i  =  <Si-i,  a,./,  As>  is  added,  where  As  =  (a  vector 
from  the  prior  to  the  current  state).  Retention  is  controlled 
by  two  parameters  xT,  and  oT(not  shown  in  Figure  2);  cT,i  is 
retained  if  either  the  distance  <7T(cT,i,lNN(fjcT,i))  between 
Cja  and  its  nearest  neighbor  in  T  is  less  than  xT,  or  if  the  dis¬ 
tance  dfcjA-As,  T(si-i,  a,.;))  between  the  actual  and  the  esti¬ 
mated  transitions  is  greater  than  the  maximum  error  per- 


Figure  3:  CASSL’s  algorithm  for  action  recommendation,  where  Sx  and  Sy  are  hypothetical  state  features 


mitted,  Ot.  Transition  cases  are  never  revised,  under  the  as¬ 
sumption  of  a  deterministic  environment. 

The  second  line  in  LearnAct  performs  conditional  case 
retention  and  revision  for  V  A  new  case  cv,/  is  added  to  V 
only  if  the  state  distance  Jv(cv,i,lNN(I(cv,i))  to  its  nearest 
neighbor  in  V  is  greater  than  xv  (not  shown  in  Figure  2). 
New  cases  are  initialized  using  the  discounted  return  (Sut¬ 
ton  and  Barto  1998): 


00 


k=i 


Otherwise,  Cv/s  k-nearest  neighbors  are  revised  to  better 
approximate  the  actual  value  of  this  region  of  the  state 
space.  The  state  value  v*  associated  with  each  nearest 
neighbor  c»ekNN(^cv)  is  revised  according  to  its  contri¬ 
bution  to  the  error  in  estimating  the  value  of  vy-_/ : 
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where  vk  is  the  value  associated  with  neighbor  cvh  a  (0  <  a 
<  1)  is  the  learning  rate,  y  (0  <  y  <  l)isa  geometric  dis¬ 
count  factor,  and  the  Gaussian  kernel  function  K(d)  =  exp(- 
d2)  determines  the  relative  contributions  of  the  k-nearest 
neighbors. 

Figure  3  summarizes  the  remaining  (action-recommen¬ 
dation)  steps  of  LearnAct,  which  next  retrieves  C,  the  set 
of  transition  cases  in  T  whose  states  are  similar  to  s,.  This 
identifies  states  that  are  reachable  from  the  current  state, 
and  actions  for  transitioning  to  them  (see  step  1  in  Figure 
3).  CASSL  uses  a  simple  k-nearest  neighbor  algorithm  on 
states  for  case  retrieval.  Flowever,  we  set  k  to  be  large  so  as 
to  retrieve  enough  information  for  the  later  regression  step 
to  succeed.  Specifically: 
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where  \a\  is  the  size  of  an  action  vector. 

Next,  for  each  nearest  neighbor  cT,*  =  <St,ak,  As*>, 
CASSL  computes  the  predicted  next  state  that  results  from 
applying  ak  in  state  sh  thus  creating  a  mapping  M  from  ac¬ 
tions  to  the  value  of  the  expected  resulting  state.  This  value 
is  calculated  by  performing  the  vector  addition  A sk  +  sh 
which  yields  the  predicted  state  sHI.  Then  V  is  reused  to 
calculate  the  expected  value  of  state  si+i  (step  3  in  Figure 
3).  Retrieval  and  reuse  are  performed  in  the  same  fashion 


as  described  for  the  step  that  updates  V 

CASSL  then  creates  a  multi-dimensional  model  of  this 
action-value  map  using  quadratic  regression  (step  4  in  Fig¬ 
ure  3),  which  is  necessary  due  to  the  continuous  nature  of 
the  state  and  action  spaces.  We  chose  quadratic  regression 
because  a  quadratic  function  often  produces  a  useful  peak 
that  is  not  at  a  point  in  the  basis  mapping,  thereby  encour¬ 
aging  exploration.  Fligher  orders  of  regression  may  also 
produce  such  results,  but  are  more  computationally  expen¬ 
sive,  and  we  would  like  to  produce  a  result  in  real  time. 

The  final  step  locates  the  action  that  maximizes  this 
model,  and  adds  it  to  M.  To  compute  this,  we  use  Flana¬ 
gan’s  (2007)  implementation  of  the  Nelder-Mead  simplex 
method,  a  well-known  method  for  finding  a  maximum  val¬ 
ue  of  a  general  /i-dimensional  function. 

The  quadratic  estimate  of  the  value  of  the  discovered  ac¬ 
tion  is  less  accurate  than  a  case-based  prediction.  Thus,  we 

*cv,k 
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iteratively  re-create  the  model,  incorporating  more  accu¬ 
rate  predictions,  by  repeating  Steps  4  and  5  (Figure  3)  until 
a  similar  action  is  found  on  two  successive  iterations,  or 
until  50  iterations  have  passed.  Similarity  between  succes¬ 
sive  actions  is  defined  as  a  Euclidean  distance  less  than  a 
small  threshold  value;  we  use  0.0001  as  the  threshold. 

4.  Evaluation 

Our  empirical  study  focuses  on  analyzing  whether  CASS¬ 
L’s  continuous  action  model  significantly  outperforms  a 
similar  algorithm  that  instead  employs  a  discrete  action 
model  on  a  task  defined  in  MadRTS.  As  an  experimenta¬ 
tion  platform,  we  used  TIELT  (2007),  the  Testbed  for  Inte¬ 
grating  and  Evaluating  Learning.  TIELT  is  a  free  tool  that 
can  be  used  to  evaluate  the  performance  of  an  agent  on 
tasks  in  an  integrated  simulation  environment.  TIELT  man¬ 
aged  communication  between  MadRTS  and  the  agents  we 
tested,  ran  the  experiment  protocol,  and  collected  results. 

We  assessed  performance  in  terms  of  a  variant  of  regret 
(Kaelbling  et  al.  1996)  that  calculates  the  difference  be¬ 
tween  the  performances  of  two  algorithms  over  time  as  a 
percentage  of  optimal  performance.  The  domain  metric 
measured  is  the  number  of  steps  required  to  complete  the 
task.  As  described  in  Section  2.1,  each  step  corresponds  to 
an  order  given  to  a  group  of  units.  After  200  steps,  a  trial  is 
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cut  off,  so  a  value  of  200  corresponds  to  failure. 

We  compared  the  performance  of  CASSL  versus  two 
baseline  algorithms.  The  first  is  random,  which  at  each 
time  step  selects  an  action  randomly  from  a  uniform  distri¬ 
bution  over  the  4-dimensional  action  space. 

The  second  algorithm  is  a  CMAC  controller  (Albus 
1975),  a  commonly  used  algorithm  for  performing  RL 
tasks  in  continuous  state  spaces.  It  uses  a  set  of  overlap¬ 
ping  tilings  of  the  state-action  space  to  approximate  the  RL 
Q(s,a)  function.  It  executes  a  query  by  averaging  the  value 
of  the  tile  in  each  tiling  that  corresponds  to  the  state-action 
input.  For  this  experiment,  we  used  five  tilings,  evenly  off¬ 
set  from  one  another.  There  are  4  tiles  per  dimension  and 
12  dimensions  in  SXA,  which  yields  a  tiling  size  of  412  and 
a  total  of  5*412  =  83. 9M  tiles.  The  structure  and  basic  oper¬ 
ations  of  our  CMAC  are  similar  to  those  described  in  (San- 
tamaria  et  al.,  1998)  with  L=0.9. 

For  both  RL  algorithms  (CASSL  and  CMAC),  we  set 

Figure  4:  Learning  performance 

a=0.2,  y=1.0,  and  e=0.5  (exploration  parameter).  Both  a 
and  e  were  decreased  asymptotically  to  0  over  time.  For 
CASSL,  we  also  set  L=0,  k=2l,  tt=0.8,  tv=0.05,  and 
aT=0.2.  We  briefly  conducted  a  manual  parameter  tuning 
process  to  obtain  reasonable  performances  from  both  algo¬ 
rithms,  but  did  not  attempt  to  optimize  their  settings. 

The  MadRTS  scenario  used  for  this  evaluation  has  a  size 
of  100  x  100  tiles,  each  covered  with  flat  terrain.  In  the 
starting  position,  3  “U.S.  Rifleman”  (powerful)  units  con¬ 
trolled  by  player  1  are  clustered  around  tile  <20,22>,  3  “In- 
surgent5_AK47”  (less  powerful)  units  controlled  by  player 
2  are  clustered  about  <2,98>,  and  1  “Insurgent  1_AK4  7” 
(powerful)  unit  controlled  by  player  2  is  at  <98, 2>.  The 
victory  condition  is  set  to  a  value  of  “conquest”,  and  diplo¬ 
macy  between  players  1  and  2  is  set  to  “hostile”.  At  these 
settings,  the  opponent  will  attempt  to  hold  his  ground  and 
destroy  all  hostile  units  that  enter  visual  range.  All  other 
settings  have  their  default  values. 

We  ran  each  agent  for  10  replications,  each  on  1000 
training  trials,  and  tested  on  5  trials  after  every  25  training 
trials.  We  report  the  average  testing  results.  Although  each 
agent  learned  on-line  within  a  testing  trial,  its  memory  was 
recorded  beforehand  and  reset  after  each  test.  To  ensure 
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Figure  5:  Early  learning  performance 

that  trials  ended  in  a  reasonable  amount  of  time,  we  cutoff 
any  that  did  not  complete  after  200  time  steps;  no  reward 
was  assigned  for  the  final  action  of  a  cutoff  trial.  A  reward 
of — 1  was  given  at  each  step  unless  the  agent  accomplished 
its  goal  (reward=1000)  by  eradicating  the  opponent’s  units. 

Figure  4  displays  the  results.  The  curves  shown  here  are 
monotonically  non-increasing  because  we  report  the  mini¬ 
mum  steps  taken  (per  algorithm)  on  any  trial  so  far  in  a 
replication  and  average  over  these  curves.  This  measure¬ 
ment  is  reasonable  because  prior  testing  performances  can 
be  repeated  by  restoring  the  state  of  the  learner;  it  is  more 
forgiving  to  algorithms  that  do  not  guarantee  that  learning 
will  never  decrease  performance. 

The  regret  of  CMAC  compared  to  CASSL  is  3.53,  which 
is  statistically  significant  (/?=0.001).  Thus,  CASSL,  using 
its  best  learned  behavior  so  far,  is  3.53%  closer  to  optimal 
performance.  Comparing  CASSL  to  the  random  agent,  the 
regret  is  38.66,  which  is  again  significant  (p  <  0.001). 

Figure  5  compares  the  early  learning  performance  of 
CMAC  and  CASSL  up  to  200  trials.  This  period  is  particu¬ 
larly  interesting  because  it  shows  that  CASSL  learns  to  do 
well  earlier  than  CMAC.  The  regret  during  this  period  (0- 
200  training  trials)  is  9.74  with p= 0.017. 

5.  Discussion 

Our  goal  was  to  demonstrate  that  selecting  from  among  all 
possible  continuous  actions  rather  than  a  priori  reducing 
their  set  (e.g.,  via  discretization)  can  significantly  improve 
performance.  However,  we  assessed  this  on  only  a  single 
scenario,  and  versus  only  two  other  algorithms.  In  future 
work,  we  will  compare  CASSL’s  performance,  empirically 
and  via  a  computational  complexity  analysis,  with  other  al¬ 
gorithms  that  can  process  continuous  action  spaces  over  a 
range  of  learning  and  performance  tasks.  This  will  include 
variants  of  CASSL  that  discretize  the  action  space. 

Other  models  for  regression  of  the  local  action-value 
function  (e.g.,  some  higher-order  polynomial  or  other  func¬ 
tion  entirely)  might  outperform  the  model  we  used.  Also,  a 
model-free  variant  of  CASSL  in  which  the  action-value 
function  is  represented  directly  should  be  studied.  The  two 
case  bases  should  scale  up  to  higher  dimensions  more  easi- 


ly,  but  we  have  not  empirically  verified  this. 

We  have  not  optimized  CASSL's  performance  (e.g.,  em¬ 
ploying  more  selective  methods  for  using  neighbors  to  cre¬ 
ate  action  recommendations).  This  remains  future  work. 

RTS  domains  often  involve  a  variety  of  similar  tasks 
with  different  initial  conditions  and  varied  goals.  For  ex¬ 
ample,  a  larger  group  of  units  might  need  to  be  destroyed 
at  a  variety  of  locations  both  near  and  far  from  the  agent’s 
home  base.  We  plan  to  analyze  the  capability  of  CASSL 
and  other  RL  agents  to  generalize  over  different  goals  and 
starting  conditions  in  an  RTS  domain. 

6.  Conclusions 

We  introduced  a  methodology  that,  unlike  our  earlier  ap¬ 
proach  (Aha  et  al.  2005),  can  learn  and  reason  with  contin¬ 
uous  action  spaces.  To  do  this  it  integrates  case-based  rea¬ 
soning  and  reinforcement  learning  methods,  and  its  imple¬ 
mentation  in  CASSL  significantly  outperformed  two  base¬ 
line  approaches  on  a  real-time  strategy  gaming  task. 

The  primary  contribution  of  this  paper  was  a  lazy  learn¬ 
ing  approach  for  action  generation  in  a  continuous  space. 
In  our  future  work,  we  will  compare  this  approach  with 
variants  of  CASSL  that  are  eager,  that  adopt  a  Q-leaming 
framework,  and/or  discretize  the  action  space. 
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